library(tidyverse)
library(ggsci)
library(gapminder)Creating effective visualisations
Background
This document provides a brief introduction to creating effective visualisations, in addition to showing you (some aspects of) how to customise ggplots. Don’t feel that you have to memorise all of this information: the document is intended to serve as a reference and to give you some ideas about what is possible when it comes to the creation of plots.
Some of the following is adapted from a blogpost by Cédric Scherer under a Creative Commons Attribution 4.0 International licence.
For further details on colour scales, figure types and other considerations for creating effective visualisations, see Claus Wilke’s book Fundamentals of Data Visualization (2019, O’Reilly), online version available here.
Load required packages
Example data set (already discussed in the context of data import)
Student-to-teacher ratios in different parts of the world:
- data from the UNESCO Institute for statistics
- made available via the Tidy Tuesday challenge
- preprocessed by Cédric Scherer – see blogpost
Read in data and inspect
st_ratios_full <- read_csv("student_teacher_ratios.csv")
head(st_ratios_full)# A tibble: 6 × 20
indicator country count…¹ eduli…² year stude…³ flag_…⁴ flags name alpha.2
<chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 Primary Edu… Afghan… AFG PTRHC_1 2017 44.0 <NA> <NA> Afgh… AF
2 Primary Edu… Albania ALB PTRHC_1 2017 17.9 <NA> <NA> Alba… AL
3 Primary Edu… Algeria DZA PTRHC_1 2017 24.2 <NA> <NA> Alge… DZ
4 Primary Edu… Angola AGO PTRHC_1 2015 50.0 <NA> <NA> Ango… AO
5 Primary Edu… Antigu… ATG PTRHC_1 2017 12.1 <NA> <NA> Anti… AG
6 Primary Edu… Argent… ARG <NA> NA NA <NA> <NA> Arge… AR
# … with 10 more variables: alpha.3 <chr>, country.code <chr>,
# iso_3166.2 <chr>, region <chr>, sub.region <chr>, region.code <chr>,
# sub.region.code <chr>, x <dbl>, y <dbl>, student_ratio_region <dbl>, and
# abbreviated variable names ¹country_code, ²edulit_ind, ³student_ratio,
# ⁴flag_codes
glimpse(st_ratios_full)Rows: 180
Columns: 20
$ indicator <chr> "Primary Education", "Primary Education", "Primar…
$ country <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ country_code <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ edulit_ind <chr> "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRHC_1", "PTRH…
$ year <dbl> 2017, 2017, 2017, 2015, 2017, NA, NA, 2017, 2017,…
$ student_ratio <dbl> 44.00995, 17.94478, 24.22505, 50.02951, 12.05576,…
$ flag_codes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ flags <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ name <chr> "Afghanistan", "Albania", "Algeria", "Angola", "A…
$ alpha.2 <chr> "AF", "AL", "DZ", "AO", "AG", "AR", "AM", "AT", "…
$ alpha.3 <chr> "AFG", "ALB", "DZA", "AGO", "ATG", "ARG", "ARM", …
$ country.code <chr> "004", "008", "012", "024", "028", "032", "051", …
$ iso_3166.2 <chr> "ISO 3166-2:AF", "ISO 3166-2:AL", "ISO 3166-2:DZ"…
$ region <chr> "Asia", "Europe", "Africa", "Africa", "North Amer…
$ sub.region <chr> "Southern Asia", "Southern Europe", "Northern Afr…
$ region.code <chr> "142", "150", "002", "002", "019", "019", "142", …
$ sub.region.code <chr> "034", "039", "015", "017", "029", "005", "145", …
$ x <dbl> 22, 15, 13, 13, 7, 6, 20, 15, 21, 4, 20, 23, 8, 1…
$ y <dbl> 8, 9, 11, 17, 4, 14, 6, 6, 7, 2, 9, 8, 6, 4, 5, 3…
$ student_ratio_region <dbl> 19.64278, 13.01069, 36.38758, 36.38758, 16.18269,…
Restrict to relevant columns:
st_ratios <- st_ratios_full |>
select(country,year,student_ratio,region)
st_ratios# A tibble: 180 × 4
country year student_ratio region
<chr> <dbl> <dbl> <chr>
1 Afghanistan 2017 44.0 Asia
2 Albania 2017 17.9 Europe
3 Algeria 2017 24.2 Africa
4 Angola 2015 50.0 Africa
5 Antigua and Barbuda 2017 12.1 North America
6 Argentina NA NA South America
7 Armenia NA NA Asia
8 Austria 2017 10.0 Europe
9 Azerbaijan 2017 15.5 Asia
10 Bahamas 2016 19.0 North America
# … with 170 more rows
Customisation for a more effective visualisation 1: basics
The student-teacher ratio data provide an interesting example, as we are interested in visualising both a measure of central tendency per region as well as the variability per region
Let’s start with a bar / column graph. We start by summarising the data and then plot the graph. Can you recall why we need to summarise first before plotting in this way?
st_by_region <- st_ratios |>
group_by(region) |>
summarise(
median_st_ratio = median(student_ratio, na.rm=TRUE),
mean_st_ratio = mean(student_ratio, na.rm=TRUE),
sd_st_ratio = sd(student_ratio, na.rm=TRUE))
st_by_region |>
ggplot(aes(x = region, y = median_st_ratio)) +
geom_col()To make the graph more readable, we should sort the columns into some meaningful order. Recall that we can use fct_reorder() to do this; it allows us to order the levels of a categorical variable (region) by the values of another column (in this case, median_st_ratio).
st_by_region |>
ggplot(aes(x = fct_reorder(region,median_st_ratio),
y = median_st_ratio)) +
geom_col()For graphs with long labels, flipping to a horizontal orientation is useful to improve readability. This is easily accomplished via coord_flip() (an alternative to swtching out x and y in the aesthetics).
st_by_region |>
ggplot(aes(x = fct_reorder(region,median_st_ratio),
y = median_st_ratio)) +
geom_col() +
coord_flip()In this case, we likely want to reorder so that the lowest value is at the top, given that this is the “best”. Note how we can do this by using a - just as in arrange(). We also add a title, remove the y-axis label, add a better x-axis label. Note that, if we want to use coord_flip(), we need to refer to the original (“unflipped”) axes e.g. with our labels.
st_by_region |>
ggplot(aes(x = fct_reorder(region,-median_st_ratio),
y = median_st_ratio)) +
geom_col() +
coord_flip() +
labs(
title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
y = "Median student-to-teacher ratio",
x = ""
)Change the theme to further customise and add some colour to the bars. Note that, if you want to change aspects of the plot that don’t involve mapping aspects of the data to aspects of the visualisation, these specs don’t go into the aesthetics – as in the example below.
You can find all available colours in R using colours().
st_by_region |>
ggplot(aes(x = fct_reorder(region,-median_st_ratio),
y = median_st_ratio)) +
geom_col(fill = "steelblue", alpha=0.8) +
coord_flip() +
labs(
title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
y = "Median student-to-teacher ratio",
x = ""
) +
theme_minimal()If we want to add a colour per region, we can do this and add a custom colour palette. To get rid of the legend, use guides(fill = "none") – this works for any aesthetic, so adapt e.g. for colour as needed.
st_by_region |>
ggplot(aes(x = fct_reorder(region,-median_st_ratio),
y = median_st_ratio, fill = region)) +
geom_col(alpha = 0.8) +
labs(
title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
x = "",
y = "Median student-to-teacher ratio"
) +
coord_flip() +
theme_minimal() +
scale_fill_brewer(palette = "Dark2") +
guides(fill = "none")For more informatio on colour palettes in R, see this website. Note that, for the use of some palettes, you will need to install additional packages.
Notes on colour palettes
This section draws on Wilke (2019).
There are three fundamental uses for colour in visualisations:
- to distinguish groups of data
- to represent data values
- to highlight
Colour to distinguish groups
- Colour is useful to distinguish discrete (unordered) items or groups, e.g. factor levels.
- Use a qualitative colour scale for this:
- finite set of specific colours that are chosen to be distinct but also equivalent
- no one colour should stand out relative to the others
- the scale should not create the impression of an order
Replot our graph from above using the RColorBrewer Set2 palette:
library(colorblindr)
st_by_region |>
ggplot(aes(x = fct_reorder(region,-median_st_ratio),
y = median_st_ratio, fill = region)) +
geom_col(alpha = 0.8) +
labs(
title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
x = "",
y = "Median student-to-teacher ratio"
) +
coord_flip() +
theme_minimal() +
scale_fill_brewer(palette = "Set2") +
guides(fill = "none")The Okabe Ito colour scale, which is a colourblind-friendly qualitative scale, can be installed using the colorblindr package – website.
To install this package, we need to take a slightly different approach, namely installing from a github repository. This is accomplished using install_github() from the devtools package. We also need to install the packages cowplot and colorspace. As in previous sessions, make sure to install packages via the console (and you only need to do this once) – don’t include the code below in your .qmd document or you will likely have trouble knitting it.
# install.packages("devtools")
# devtools::install_github("wilkelab/cowplot")
# install.packages("colorspace")
# devtools::install_github("clauswilke/colorblindr")Replot our graph from above using the Okabe Ito scale:
library(colorblindr)
st_by_region |>
ggplot(aes(x = fct_reorder(region,-median_st_ratio),
y = median_st_ratio, fill = region)) +
geom_col(alpha = 0.8) +
labs(
title = "Student-to-teacher ratios are lowest in Europe and highest in Africa",
x = "",
y = "Median student-to-teacher ratio"
) +
coord_flip() +
theme_minimal() +
scale_fill_OkabeIto() +
guides(fill = "none")Example for use of colour to distinguish groups
Note how we can also use fct_reorder() with mutate(), which means that we don’t need to include it in our ggplot() code. The final line of the chunk below shows how to increase text size in a plot.
gapminder |>
filter(year == 2007) |>
mutate(country = fct_reorder(country, pop)) |>
filter(pop > quantile(pop, probs = c(0.75))) |>
ggplot(aes(x = pop/1000000, y = country, fill = continent)) +
geom_col() +
scale_fill_brewer(palette = "Dark2") +
labs(
title = "Population in 2007",
subtitle = "Countries in top quartile for population",
x = "Population (millions)",
y = "Country",
fill = "Continent"
) +
theme_bw() +
theme(text = element_text(size = 12))Using colour scales in ggplot
The following figure is from the ggplot2 cheatsheet
knitr::include_graphics("images/ggplot_scales.png")Colour to represent values
We can use a sequential colour scale to represent quantitative values, i.e. a sequence of colours that:
- specifies which values are larger or smaller
- indicates distance between values
- is perceived to vary uniformly across entire range of values
- can be based on single hue or multiple hues
Example for use of colour to represent values
- colours to represent values can be useful to show how values vary across geographic regions (a choropleth map)
- this code, which is adapted from [this tutorial on drawing maps in R] (https://www.r-spatial.org/r/2018/10/25/ggplot2-sf.html), uses
scale_fill_viridis_c() - note that the ggplot syntax for maps is a little different to what we have seen before: it uses
geom_sf(); you don’t need to worry about the details of this unless you would like to use maps for your work
# note: to install multiple packages at once, use:
# install.packages(c("sf","rgeos","rnaturalearth","rnaturalearthdata"))
library(sf)
library(rgeos)
library(rnaturalearth)
library(rnaturalearthdata)
world <- ne_countries(scale = "medium", returnclass = "sf")
world |>
ggplot(aes(fill = pop_est)) +
geom_sf() +
scale_fill_viridis_c(option = "plasma", trans = "sqrt") +
labs(
title = "World map with population estimate",
x = "Longitude",
y = "Latitude",
fill = "Population\nestimate"
)Diverging scales
Use a diverging scale to visualise values diverging from a midpoint (e.g. dataset containing positive and negative numbers):
- ~ two sequential scales combined at a common midpoint (usually light colour)
- scale needs to be balanced so that progression from light to dark is perceived similarly in both directions
Example for use of a diverging scale
- this plot uses
scale_fill_brewerwith palettePiYG
world |>
ggplot(aes(fill = income_grp)) +
geom_sf() +
scale_fill_brewer(palette = "PiYG") +
labs(
title = "Income group",
x = "Longitude",
y = "Latitude",
fill = "Income group"
)Colour to highlight
Colour can be used to highlight specific elements, e.g. particular categories or values to emphasise.
- Use an accent colour scale
- subdued colours
- matching set of stronger, darker or more saturated colours
Example for use of colour to highlight
- this plot uses
scale_colour_brewerwith paletteDark2 - and the
gghighlightpackage
library(gghighlight)
gapminder |>
group_by(continent,year) |>
summarise(lifeExp = mean(lifeExp)) |>
ggplot(aes(x = year, y = lifeExp, colour = continent)) +
geom_line(size = 1.5) +
gghighlight(continent %in% c("Oceania", "Asia")) +
theme_bw() +
scale_colour_brewer(palette = "Dark2") +
theme(text = element_text(size = 20)) +
labs(
title = "Life expectancy by year and continent",
x = "Year",
y = "Life expectancy",
colour = "Country"
)Example: Brewer colour scales
- from
?scale_colour_brewer
Customisation for a more effective visualisation 2: adding variability information
Back to our example … Note that much of the following is based on Cedric Scherer’s blog (see link above).
What if we wanted to add variability information?
We could use a boxplot:
st_ratios |>
ggplot(aes(x = region, y = student_ratio)) +
geom_boxplot()Let’s start by making some modifications to make the plot more readable in line with the above considerations. To be able to order the boxplot, we need to add a variable with a regional summary statistic – we will use the median here. Note the use of mutate() in conjunction with group_by() to create a new variable with group-based values.
Note also the use of a custom colour scale from the University of Chicago via the ggsci package and how we customise the limits for the x axis (flipped from y).
st_ordered <- st_ratios |>
group_by(region) |>
mutate(st_by_region = median(student_ratio, na.rm=TRUE)) |>
ungroup() |>
mutate(region = fct_reorder(region, -st_by_region))
st_ordered |>
ggplot(aes(x = region, y = student_ratio, fill = region)) +
geom_boxplot() +
theme_light() +
scale_y_continuous(limits = c(0, 90)) +
coord_flip() +
scale_fill_uchicago() +
labs(
title = "Student-teacher ratios are highest and most variable in Africa",
x = "",
y = "Student-to-teacher ratio"
) +
guides(fill = "none")Now add some other adjustments … Note how we can create a new object g with the basic setup of our plot, to which we can subsequently add various geometries:
library(showtext)
font_add_google("Poppins", "Poppins")
font_add_google("Roboto Mono", "Roboto Mono")
showtext_auto()
theme_set(theme_light(base_size = 18, base_family = "Poppins"))
g <-
st_ordered |>
ggplot(aes(x = region, y = student_ratio, colour = region)) +
coord_flip() +
scale_y_continuous(limits = c(0, 90), expand = c(0.005, 0.005)) +
scale_color_uchicago() +
labs(x = NULL, y = "Student to teacher ratio") +
theme(
legend.position = "none",
axis.title = element_text(size = 16),
axis.text.x = element_text(family = "Roboto Mono", size = 12),
panel.grid = element_blank()
)Look at varying geometries:
g + geom_boxplot()g + geom_violin()g + geom_line(linewidth = 1)g + geom_point(size = 1)g + geom_jitter(size = 1)We can also combine geoms for more information! Note that the alpha parameter varies transparency: it ranges from 0 (fully transparent) to 1 (fully opaque). In geom_boxplot() the outlier.alpha parameter varies the transparency of the outlier points: by setting it to 0 we effectively hide these and ensure that we don’t get an overlap with the points drawn by geom_jitter().
g + geom_boxplot(outlier.alpha = 0) +
geom_jitter(size = 2, alpha = 0.3)For a more intuitive visualisation, we use geom_jitter() combined with the mean per region. We can get ggplot to compute this for us using stat_summary(). With set.seed(), we ensure that the “random” dot distribution produced by geom_jitter() is reproducible – this will become important when adding labels later.
set.seed(2019)
g +
geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
stat_summary(fun = mean, geom = "point", size = 5)Add a line for world average:
world_avg <-
st_ordered |>
summarise(avg = mean(student_ratio, na.rm = T)) |>
# pull ensures that we have only a single value not a data frame
pull(avg)
g +
geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
stat_summary(fun = mean, geom = "point", size = 5)Now add lines from the world average to the regional averages:
g +
geom_segment(
aes(x = region, xend = region,
y = world_avg, yend = st_by_region),
size = 0.8
) +
geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
stat_summary(fun = mean, geom = "point", size = 5)And finally … add some text:
g +
geom_segment(
aes(x = region, xend = region,
y = world_avg, yend = st_by_region),
size = 0.8
) +
geom_hline(aes(yintercept = world_avg), colour="grey70", size=0.6) +
geom_jitter(size = 2, alpha = 0.25, width = 0.2) +
stat_summary(fun = mean, geom = "point", size = 5) +
annotate(
"text", x = 6.3, y = 35, family = "Poppins", size = 2.8, color = "gray20", lineheight = .9,
label = glue::glue("Worldwide average:\n{round(world_avg, 1)} students per teacher")
) See the blogpost for further details on how to add more text as well as arrows.
Customisation for a more effective visualisation 2: an alternative way to visualise distributions
We could also plot density ridges!
For an introduction to geom_density_ridges() from the ggridges package, see https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html They expand on the basic geom_density() in ggplot.
Also note the addition of a caption to acknowledge the source of the data.
library(ggridges)
st_ordered |>
ggplot(aes(x = student_ratio, y = region, fill = region, colour = region)) +
geom_density_ridges(alpha = 0.5, rel_min_height = 0.001, scale = 0.9, quantile_lines = TRUE, quantiles = 2) +
scale_color_uchicago() +
scale_fill_uchicago() +
labs(
y = NULL,
x = "Student to teacher ratio",
caption = "Source: UNESCO") +
theme(
legend.position = "none",
axis.title = element_text(size = 16),
axis.text.x = element_text(family = "Roboto Mono", size = 12),
panel.grid = element_blank(),
)